| ILI posts | Control posts | ILI incidence | |
|---|---|---|---|
| ILI posts | 1.000 | 0.863 | 0.773 |
| Control posts | 0.863 | 1.000 | 0.541 |
| ILI incidence | 0.773 | 0.541 | 1.000 |
Prototyping a data extraction pipeline for bluesky.social and exploration of bluesky user activity for influenza like digital disease detection
Digital Epidemiology 2025, Hasselt University
2025-04-10
bluesky social networkbluesky APIbluesky messagesbluesky: general aspectstwitter in user experienceDecentralized User Identifier (DID)
Personal Data servers (PDS)
DIDs and affiliated contents are portable between PSDs
Users can choose, prioritize and develop feed generators and content labelers
twitter by Elon MuskX in Brazil, presidential election in the USblueskyGoogle scholar search : “bluesky” AND “social” since 2022
43 articles
main topics:
X to bluesky 2024no results for
searchPosts API methodselected parameters:
q: search querysince, until: defining search periodlimit: max. 100 posts
deterministic search
allows exhaustive sampling
defined in the SDK documentation
fields (selection):
uri: unique post identifierauthor: contains did which allows to retrieve user profilerecord: contains the text and time information of the message
langs: language(s) detected by the bluesky serverembedded: any embedded media (images, other posts, etc …)in contrary to former twitter post metadata, no geoinformation
Feedgens
Labelers
no geo information
getProfiles API endpointbluesky post data for digital disease surveillance
Implementation of a continuous surveillance pipeline
focused on French bluesky posts (data volume constraint)
extraction using list of keywords 1
extraction of
WHO Flumart
Data analysis starting from 2023-08-01
| ILI posts | Control posts | ILI incidence | |
|---|---|---|---|
| ILI posts | 1.000 | 0.863 | 0.773 |
| Control posts | 0.863 | 1.000 | 0.541 |
| ILI incidence | 0.773 | 0.541 | 1.000 |
\(Y_w:\) ILI Incidence in week \(w\)
\(X_w:\) Input features obtained in week \(w\)
\[Y_{w+1} = f(X_w, X_{w-1}, X_{w-2})\]
| Dataset | MAE* | RMSE |
|---|---|---|
| Training | \(23.96\) | \(33.93\) |
| Validation + | \(56.54\) | \(56.54\) |
* Mean absolute error, incidence per 100,000
+ mean over all validation runs
model agnostic feature importance procedure
random shuffling of single input features
json structured output option for convenient data processingAnalyze the following tweet-like message to determine if it describes the user's own influenza-like illness (ILI). ILI is defined by:
- Fever ≥38°C (100°F) **AND**
- At least one respiratory symptom (cough or sore throat) **PLUS**
- Additional systemic symptoms (headache, muscle aches, chills, fatigue, nasal congestion)
...
{ ... bluesky message dynamically inserted here ... }
Extraction using google Gemini API
Je suis un peu malade (fièvre, frissons, rhume). Je décide de télétravailler aujourd’hui. Et là, je réalise que les voisins font des travaux. Avec des engins électriques bruyants. Ma tête va exploser !
I'm a little sick (fever, chills, colds). I decide to telework today. And there, I realize that the neighbors do work. With noisy electric vehicles. My head will explode!
fièvre,frissons,rhume,tête va exploser
Grippe aviaire : les coupes budgétaires de Trump amplifient la menace pandémique www.lepoint.fr/tiny/1-2586310 #Santé via @lepoint.fr
Aviary flu: Trump's budget cuts amplify the pandemic threat www.lepoint.fr/tiny/1-2586310 #health via @lepoint.fr
nan
| LLM ILI posts | ILI incidence | Control posts | |
|---|---|---|---|
| LLM ILI posts | 1.000 | 0.793 | 0.812 |
| ILI incidence | 0.793 | 1.000 | 0.557 |
| Control posts | 0.812 | 0.557 | 1.000 |
| Dataset | MAE* (LLM filtered) |
|---|---|
| Training | \(26.64\) |
| Validation | \(58.87\) |
* Mean absolute error, incidence per 100,000
bluesky = promising data sourceinvestigate impact of LLM filtering on model performance
modeling of weekly ILI incidence based on message content
continuous data acquisition pipeline (WIP)
User localization based on profile
monitoring of bursts in user activity crucial
repeating the analysis for another country (e.g. Germany)
graph LR
subgraph kestra
dlt(dlt) --- posts
llm --- bqstaging
llm -- annotation --> bqstaging
posts --> bqstaging[<b>GBQ</b> \n stage area \n 1 table per kw]
dlt -- housekeeping --> count
dlt -- case data --> who_tables
dlt -- case data --> cdc_tables
subgraph BigQuery data lake
bqstaging
who_tables
cdc_tables
count[post counts table]
end
bqstaging --- dbt
who_tables --- dbt
cdc_tables --- dbt
count --- dbt
dbt --> bq[Google \n BigQuery]
subgraph BigQuery data warehoue
bq
end
end
bsky[bsky API] --> dlt
WHO --> dlt
CDC --> dlt
bq --> looker[Looker studio \n dashboard]
bq -- python --> stat1[Statistical analysis]
bq -- python --> stat2[Machine learning, modeling]
Open source implementation
dlt)dbt)kestraavailable at: https://github.com/kantundpeterpan/bluesky_ddd_influenza